Text Categorization of Low Quality Images
نویسندگان
چکیده
Categorization of text images into content oriented classes would be a useful capability in a variety of document handling systems Many methods can be used to cat egorize texts once their words are known but OCR can garble a large proportion of words particularly when low quality images are used Despite this we show for one data set that fax quality images can be cat egorized with nearly the same accuracy as the original text Further the categoriza tion system can be trained on noisy OCR output without need for the true text of any image or for editing of OCR output The use of a vector space classi er and train ing method robust to large feature sets com bined with discarding of low frequency OCR output strings are the key to our approach
منابع مشابه
Image spam filtering using textual and visual information
In this paper we focus on the so-called image spam, which consists in embedding the spam message into images attached to e-mails to circumvent statistical techniques based on the analysis of body text of e-mails (like the “bayesian filters”), and in applying content obscuring techniques to such images to make them unreadable by standard OCR systems without compromising human readability. We arg...
متن کاملText categorization using character shape codes
Text categorization in the form of topic identification is a capability of current interest. This paper is concerned with categorization of electronic document images. Previous work on the categorization of document images has relied on Optical Character Recognition (OCR) to provide the transformation between the image domain and a domain where pattern recognition techniques are more readily ap...
متن کاملRobust Statistical Techniques for the Categorization of Images Using Associated Text
Robust Statistical Techniques for the Categorization of Images Using Associated Text
متن کاملImprovement of generative adversarial networks for automatic text-to-image generation
This research is related to the use of deep learning tools and image processing technology in the automatic generation of images from text. Previous researches have used one sentence to produce images. In this research, a memory-based hierarchical model is presented that uses three different descriptions that are presented in the form of sentences to produce and improve the image. The proposed ...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کامل